Session 9: Scraping Interactive Web Pages (part 2

Introduction to Web Scraping and Data Management for Social Scientists

Johannes B. Gruber

2024-08-01

Browser automation

What is Browser Automation?

  • Definition: the process of using software to control web browsers and interact with web elements programmatically
  • Tools Involved: Common tools include Selenium, Puppeteer, and Playwright
  • These tools allow scripts to perform actions like clicking, typing, and navigating through web pages automatically

Common Uses of Browser Automation

  • Testing: Widely used in software development for automated testing of web applications to ensure they perform as expected across different environments and browsers
  • Task Automation: Simplifies repetitive tasks such as form submissions, account setups, or any routine that can be standardized across web interfaces

Browser Automation in Web Scraping

  • Dynamic Content Handling: Essential for scraping websites that load content dynamically with JavaScript. Automation tools can interact with the webpage, wait for content to load, and then scrape the data.
  • Simulation of User Interaction: Can mimic human browsing patterns to interact with elements (like dropdowns, sliders, etc.) that need to be manipulated to access data
  • Avoiding Detection: More sophisticated than basic scraping scripts, browser automation can help mimic human-like interactions, reducing the risk of being detected and blocked by anti-scraping technologies

Example: Google Maps

Goal

  1. Check the communte time programatically
  2. Extract the distance and time it takes to make a journey

Can we use rvest?

static <- read_html("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")
static |> 
  html_elements(".Fk3sm") |> 
  html_text2()
character(0)

google maps commute

Let’s use browser Automation

The new read_html_live from rvest solves this by emulating a browser:

# loads a real web browser
sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

You can have a look at the browser with:

sess$view()

Unfortunately, we do not get content yet. We first have to click on “Accept all”

Let’s use browser Automation

After manipulating something about the session, you need to read it into R again:

sess <- read_html_live("https://www.google.de/maps/dir/Armadale+St,+Glasgow,+UK/Lilybank+House,+Glasgow,+UK/@55.8626667,-4.2712892,14z/data=!3m1!4b1!4m14!4m13!1m5!1m1!1s0x48884155c8eadf03:0x8f0f8905398fcf2!2m2!1d-4.2163615!2d55.8616765!1m5!1m1!1s0x488845cddf3cffdb:0x7648f9416130bcd5!2m2!1d-4.2904601!2d55.8740368!3e0?entry=ttu")

Then we can extract information:

# the session behaves like a normal rvest html object
trip <- sess |> 
  html_elements("#section-directions-trip-0")

trip |> 
  html_element("h1") |> 
  html_text2()
[1] "via M8"
trip |> 
  html_element(".fontHeadlineSmall") |> 
  html_text2()
[1] "17 min"
trip |> 
  html_element(".fontBodyMedium") |> 
  html_text2()
[1] "4.2 miles"

Store the cookies

  • Now that we have accepted the cookie banner, a small set of cookies was stored in the browser
  • These are destroyed when we close R, however
  • We can extract them and save the session for the next run, so no manual intervention is neccesary
cookies <- sess$session$Network$getCookies()
saveRDS(cookies, "data/chromote_cookies.rds")

In the next run, you can load the cookies with:

cookies <- readRDS("data/chromote_cookies.rds")
sess <- sess$session$Network$setCookies(cookies = cookies$cookies)

Example: Reddit

Goal

  1. Get Reddit posts
  2. Get the timestamp, up/downvot score, post and comment text

Can we use rvest to get post URLs?

html_subreddit <- read_html("https://www.reddit.com/r/wallstreetbets/")

posts <- html_subreddit |> 
  html_elements("a") |> 
  html_attr("href") |> 
  str_subset("/comments/") |> 
  str_replace_all("^/", "https://www.reddit.com/") |> 
  unique()
posts
[1] "https://www.reddit.com/r/wallstreetbets/comments/1ehcsoq/daily_discussion_thread_for_august_01_2024/"     
[2] "https://www.reddit.com/r/wallstreetbets/comments/1ec3ehr/most_anticipated_earnings_releases_for_the_week/"
[3] "https://www.reddit.com/r/wallstreetbets/comments/1eh58yz/05m_goal_reached/"                               

This does not look too bad!

Although we only get 3 posts 🤔

Can we use rvest to get posts?

html_post <- read_html("https://www.reddit.com/r/wallstreetbets/comments/1ehcsoq/daily_discussion_thread_for_august_01_2024/")
post_data <- html_post |> 
  html_elements("shreddit-post")

post_data |> 
  html_attr("created-timestamp") |> 
  lubridate::as_datetime()
[1] "2024-08-01 09:57:12 UTC"
post_data |> 
  html_attr("id")
[1] "t3_1ehcsoq"
post_data |> 
  html_attr("subreddit-id")
[1] "t5_2th52"
post_data |> 
  html_attr("score")
[1] "23"
post_data |> 
  html_attr("comment-count")
[1] "489"

We actually can!

Can we use rvest to get comments?

comments <- html_post |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (0)}

😑

How about read_html_live?

html_post_live <- read_html_live("https://www.reddit.com/r/wallstreetbets/comments/1ehcsoq/daily_discussion_thread_for_august_01_2024/")
comments <- html_post_live |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (10)}
 [1] <shreddit-comment author="Elonmuskishuman" thingid="t1_lfyjsyy" depth="0 ...
 [2] <shreddit-comment author="Dolo12345" thingid="t1_lfyjsvi" depth="0" perm ...
 [3] <shreddit-comment author="PlantAstronaut" thingid="t1_lfyjpva" depth="0" ...
 [4] <shreddit-comment author="KaleidoscopeNo828" thingid="t1_lfyjpi2" depth= ...
 [5] <shreddit-comment author="Frozen_Shades" thingid="t1_lfyjopm" depth="0"  ...
 [6] <shreddit-comment author="Zurkarak" thingid="t1_lfyjnav" depth="0" perma ...
 [7] <shreddit-comment author="mistaowen" thingid="t1_lfyjmxp" depth="0" perm ...
 [8] <shreddit-comment author="Rebornzzz" thingid="t1_lfyjmrs" depth="0" perm ...
 [9] <shreddit-comment author="yakuzaDotes" thingid="t1_lfyjlmg" depth="0" pe ...
[10] <shreddit-comment author="HealthyFly1561" thingid="t1_lfyjk77" depth="0" ...

😁

But again, something is missing 🤔

Interacting with the session

html_post_live$view()

Scrolling down as far as possible:

html_post_live$scroll_to(top = 10 ^ 5)

This triggers new content to be loaded:

comments <- html_post_live |> 
  html_elements("shreddit-comment")
comments
{xml_nodeset (10)}
 [1] <shreddit-comment author="Elonmuskishuman" thingid="t1_lfyjsyy" depth="0 ...
 [2] <shreddit-comment author="Dolo12345" thingid="t1_lfyjsvi" depth="0" perm ...
 [3] <shreddit-comment author="PlantAstronaut" thingid="t1_lfyjpva" depth="0" ...
 [4] <shreddit-comment author="KaleidoscopeNo828" thingid="t1_lfyjpi2" depth= ...
 [5] <shreddit-comment author="Frozen_Shades" thingid="t1_lfyjopm" depth="0"  ...
 [6] <shreddit-comment author="Zurkarak" thingid="t1_lfyjnav" depth="0" perma ...
 [7] <shreddit-comment author="mistaowen" thingid="t1_lfyjmxp" depth="0" perm ...
 [8] <shreddit-comment author="Rebornzzz" thingid="t1_lfyjmrs" depth="0" perm ...
 [9] <shreddit-comment author="yakuzaDotes" thingid="t1_lfyjlmg" depth="0" pe ...
[10] <shreddit-comment author="HealthyFly1561" thingid="t1_lfyjk77" depth="0" ...

Automate the scrolling

last_y <- -1
#scroll as far as possible
while (html_post_live$get_scroll_position()$y > last_y) {
  last_y <- html_post_live$get_scroll_position()$y
  html_post_live$scroll_to(top = 10 ^ 5)
  load_more <- html_post_live |> 
    html_element("[noun=\"load_more_comments\"]") |> 
    length()
  if (load_more) {
    html_post_live$click("[noun=\"load_more_comments\"]")
    html_post_live$scroll_to(top = 10 ^ 5)
  }
  Sys.sleep(1 * runif(1, 1, 3))
}

Alternative: Playwright

Introducing Playwright

  • Tool for web testing
  • Testing a website and scraping it is actually quite similar
  • It essentially uses a special version of a web browser that can be controlled through code from different languages
  • Unfortunately no R package that wraps the API yet (but and R package that wraps the Python package)
  • Alternatives you might have heard of: Selenium and Puppeteer

First, install it

We want to use playwrightr, which is an R package to control the Python package for Playwright. So we need 3 pieces for this:

  1. The R package: install it with remotes::install_github("JBGruber/playwrightr")
  2. The Python package: we install this into a virtual environment using reticulate
  3. The Playwright executable, which consists of a modified version of Chrome that can be remote controlled

All three steps are done when you run the code below:

if (!rlang::is_installed("playwrightr")) remotes::install_github("JBGruber/playwrightr")
if (!reticulate::virtualenv_exists("r-playwright")) {
  reticulate::virtualenv_install("r-playwright", packages = "playwright")
  playwright_bin <- reticulate::virtualenv_python("r-playwright") |> 
    stringr::str_replace("python", "playwright")
  system(paste(playwright_bin, "install chromium"))
}
reticulate::use_virtualenv("r-playwright")

Control Playwright from r with an experimental package

I did not write the package, but made some changes to make it easier to use.

To get started, we first initialize the underlying Python package and then launch Chromium:

library(reticulate)
library(playwrightr)
pw_init()
chrome <- browser_launch(
  browser = "chromium", 
  headless = !interactive(), 
  # make sure data like cookies are stored between sessions
  user_data_dir = "user_data_dir/"
)

Now we can navigate to a page:

page <- new_page(chrome)
goto(page, "https://www.facebook.com/groups/911542605899621")
# A tibble: 1 × 2
  browser_id    page_id      
  <chr>         <chr>        
1 id_LxhKcPPiCt id_SC9CpqS1GV

When you are in Europe, the page asks for consent to save cookies in your browser:

Getting the page content

Okay, we now see the content. But what about collecting it? We can use several different get_* functions to identify specfic elements. But wen can also simply get the entire HTML content:

html <- get_content(page)
html
{html_document}
<html id="facebook" class="_9dls __fb-light-mode" lang="de" dir="ltr">
[1] <head>\n<meta http-equiv="Content-Type" content="text/html; charset=UTF-8 ...
[2] <body class="_6s5d _71pn system-fonts--body">\n<script type="application/ ...

Conveniently, this is already an rvest object. So we can use our familiar tools to get to the links of the visible posts. The page uses a role attribute which Iemploy here and I know that links to posts contain posts:

post_links <- html |> 
  html_elements("[role=\"link\"]") |> 
  html_attr("href") |> 
  str_subset("posts")
head(post_links)
character(0)

Collecting Post content

Now we can visit the page of one of these posts and collect the content from it:

post1 <- new_page(chrome)
# go to the page
goto(post1, post_links[1])
post1_html <- get_content(post1)

We can check the content we collected locally:

check_in_browser <- function(html) {
  tmp <- tempfile(fileext = ".html")
  writeLines(as.character(html), tmp)
  browseURL(tmp)
}
check_in_browser(post1_html)

Scraping the content: failure

The site uses a lot of weird classes, Making it almost impossible to get content:

author <- post1_html |> 
  html_elements("h2") |> 
  html_text2() |> 
  head(1)

text <- post1_html |> 
  html_element("[id=\":r1j:\"]") |> 
  html_text2()

tibble(author, post_time, text)

Nevertheless, some success…

So now that we have the content, we can close the page:

close_page(post1)

What is cool about Playwright

playwright_bin <- reticulate::virtualenv_python("r-playwright") |> 
  stringr::str_replace("python", "playwright")
system(paste(playwright_bin, "codegen"))

This can produce Python scripts

import playwright
def run(playwright: Playwright) -> None:
    browser = playwright.chromium.launch(headless=False)
    context = browser.new_context()
    page = context.new_page()
    page.goto("https://www.google.com/")
    page.get_by_role("button", name="Accept all").click()
    page.get_by_label("Search", exact=True).click()
    page.get_by_label("Search", exact=True).fill("amsterdam bijstand")
    page.get_by_role("link", name="Bijstandsuitkering Gemeente").click()

    # ---------------------
    context.close()
    browser.close()


with sync_playwright() as playwright:
    run(playwright)

Summary: Browser Automation

What is it

  • remote control a browser to perform pre-defined steps
  • several available tools:
    • native R: chromote, used in read_html_live
    • native Python: Playwright, can be used from R with playwrightr (experimental)
    • native Java: Selenium, can be used from R with RSelenium (buggy and outdated)
    • native JavaScript: Puppeteer not R bindings

What are they good for?

  • Get content from pages which you can’t otherwise access
  • Load more content through automated scrolling on dynamic pages
  • Automate tasks like downloading files

Issues

  • Companies have mechanisms to counter scraping:
    • rate limiting requests per second/minute/day and user/IP(Twitter)
    • captchas (can be solved but quite complex)
  • Won’t get you around very obscure HTML code (Facebook)
  • Quite heavy and very slow compared to requests

Wrap Up

Save some information about the session for reproducibility.

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: x86_64-pc-linux-gnu
Running under: EndeavourOS

Matrix products: default
BLAS:   /usr/lib/libblas.so.3.12.0 
LAPACK: /usr/lib/liblapack.so.3.12.0

locale:
 [1] LC_CTYPE=en_GB.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=en_GB.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=en_GB.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

time zone: Europe/London
tzcode source: system (glibc)

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
 [1] playwrightr_0.0.0.9000 reticulate_1.36.1      rvest_1.0.4           
 [4] httr2_1.0.1            lubridate_1.9.3        forcats_1.0.0         
 [7] stringr_1.5.1          dplyr_1.1.4            purrr_1.0.2           
[10] readr_2.1.5            tidyr_1.3.1            tibble_3.2.1          
[13] ggplot2_3.5.1          tidyverse_2.0.0       

loaded via a namespace (and not attached):
 [1] rappdirs_0.3.3    utf8_1.2.4        generics_0.1.3    xml2_1.3.6       
 [5] lattice_0.22-6    stringi_1.8.4     hms_1.1.3         digest_0.6.35    
 [9] magrittr_2.0.3    evaluate_0.23     grid_4.4.1        timechange_0.3.0 
[13] fastmap_1.1.1     Matrix_1.7-0      jsonlite_1.8.8    processx_3.8.4   
[17] chromote_0.2.0    ps_1.7.7          promises_1.3.0    httr_1.4.7       
[21] selectr_0.4-2     fansi_1.0.6       scales_1.3.0      cli_3.6.3        
[25] rlang_1.1.4       munsell_0.5.1     withr_3.0.0       yaml_2.3.8       
[29] tools_4.4.1       tzdb_0.4.0        colorspace_2.1-0  curl_5.2.1       
[33] png_0.1-8         vctrs_0.6.5       R6_2.5.1          lifecycle_1.0.4  
[37] pkgconfig_2.0.3   pillar_1.9.0      later_1.3.2       gtable_0.3.5     
[41] glue_1.7.0        Rcpp_1.0.12       xfun_0.44         tidyselect_1.2.1 
[45] rstudioapi_0.16.0 knitr_1.46        websocket_1.4.1   htmltools_0.5.8.1
[49] rmarkdown_2.26    compiler_4.4.1